Tests: move generate tests to the right mixin and delete redundant tests#34464
Tests: move generate tests to the right mixin and delete redundant tests#34464gante merged 17 commits intohuggingface:mainfrom
generate tests to the right mixin and delete redundant tests#34464Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| # Exception 1: when passing input_embeds, input_ids may be missing entries | ||
| # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here | ||
| # Exception 3: with synced GPUs cache_position may go out of bounds, but we only want dummy token in that case | ||
| # Exception 3: with synced GPUs cache_position may go out of bounds, but we only want dummy token in that case. |
There was a problem hiding this comment.
(changes in this file were done for end-to-end compilation)
| def _check_similar_generate_outputs(self, output_1, output_2, atol=1e-5, rtol=1e-5): | ||
| """ | ||
| Checks whether a pair of generate outputs are similar. Two `generate` call outputs are considered similar in | ||
| the following siturations: | ||
| 1. The sequences are the same | ||
| 2. The sequences are different, but the scores up until the first mismatch are nearly identical | ||
| """ |
There was a problem hiding this comment.
Many generate tests wanted to check equivalence of something, and were incorrectly tagged as flaky. This function contains the correct equivalence check -- as a result, several @is_flaky were removed :D
(I honestly don't know why I haven't implemented this before, instead of adding @is_flaky 🙃 )
| self.assertIsInstance(output_generate, GreedySearchDecoderOnlyOutput) | ||
|
|
||
| self._check_outputs(output_generate, main_input, model.config) | ||
| self._check_outputs(output_generate, model.config) |
There was a problem hiding this comment.
We actually don't need main_input, we can obtain the batch size directly from the many outputs of generate
tests/generation/test_utils.py
Outdated
| def test_generate_from_inputs_embeds(self): | ||
| """Tests that we can generate from `inputs_embeds` instead of `input_ids` in LLMs, VLMs, etc""" |
There was a problem hiding this comment.
Related to the deleted inputs_embeds tests -- clarifies that this test also runs for VLMs
| Tests that generating with static cache give almost same results as with dynamic cache, and the output cache | ||
| has the expected shapes |
There was a problem hiding this comment.
Merges this existing test, which checked the static cache returned by generate, with the deleted static cache equivalence test.
One generate call is enough to run both checks :)
|
For tests where |
tests/generation/test_utils.py
Outdated
| ("end_to_end", True), # TODO (@joao): end-to-end compilation is broken with torch 2.5+, explore and fix | ||
| ] | ||
| ) | ||
| def test_generate_compile(self, name, end_to_end): |
There was a problem hiding this comment.
Instead of having a test for torch.compile against large checkpoints, let's do it against dummy checkpoints -- the importance of the test is to confirm a) we can compile AND b) compiled is equivalent to non-compiled.
Testing model.forward compilation is very similar to testing model.generate compilation, so parameterized is used. A few modifications were added to make the test more efficient (e.g. compile only once)
Note: this test doesn't have to be slow, but we currently have a few failures. Let's fix them in a follow-up PR
|
|
||
| @require_torch_sdpa | ||
| @slow | ||
| def test_eager_matches_sdpa_generate(self): |
There was a problem hiding this comment.
The test to check that SDPA = eager is the same as the test to check that FA2 = eager. However, parameterized doesn't work here: we want different pytest.mark decorators for each test. As such, the original test was moved to a helper function.
| )[0] | ||
| self.assertTrue(torch.allclose(out_embeds, out_ids)) | ||
|
|
||
| def test_inputs_embeds_matches_input_ids_with_generate(self): |
There was a problem hiding this comment.
We already had a test that checks that generating from inputs_embeds is equivalent to generating from input_ids -- test_generate_from_inputs_embeds
This test was deleted.
There was a problem hiding this comment.
one q: afaik the other test is skipped for encoder-decoder models? Can we enable it just to be sure we cover all models with generation abilities?
There was a problem hiding this comment.
Good point, thank you for catching that! I'll update test_generate_from_inputs_embeds so as to support VLMs
There was a problem hiding this comment.
✅ test_generate_from_inputs_embeds updated for VLMs, overwrites deleted. @zucchini-nlp have a look at the updated test -- you might need to add VLMs to the list of added special cases in the test
| @mark.flash_attn_test | ||
| @slow | ||
| @is_flaky() | ||
| def test_flash_attn_2_generate_left_padding(self): |
There was a problem hiding this comment.
We have:
- a test that checks that we can generate with left-padding (
test_left_padding_compatibility) - a test that checks that we can generate with FA2 (
test_eager_matches_fa2_generate, added in this PR) - a test that checks that we can run the forward pass in FA2 with left-padding (
test_flash_attn_2_inference_equivalence)
As such, this test was redundant. From generate's point of view, FA2 is a modification to model.forward, so if model.forward is equivalent then so is model.generate
Deleted.
| @mark.flash_attn_test | ||
| @is_flaky() | ||
| @slow | ||
| def test_flash_attn_2_generate_padding_right(self): |
There was a problem hiding this comment.
Similar as above (we have individual checks in place). Moreover, generating with right-padding doesn't make sense on most models :D
Deleted.
| @require_torch_gpu | ||
| @mark.flash_attn_test | ||
| @slow | ||
| def test_flash_attn_2_generate_use_cache(self): |
There was a problem hiding this comment.
This test was pretty toothless -- it didn't even check the actual output of generate.
I've added test_eager_matches_fa2_generate in its place, which checks the outputs. It's a clone of the SDPA generate equivalence test, except that it uses FA2.
| @require_torch_gpu | ||
| @mark.flash_attn_test | ||
| @slow | ||
| def test_flash_attn_2_generate_reuse_cache(self): |
There was a problem hiding this comment.
Similar to the FA2 + padding tests above.
In this case, we check that we can reuse a cache in test_generate_continue_from_past_key_values, no need for a FA2-specific test.
Deleted.
| normalized_1 = F.softmax(out_shared_prefix_last_tokens) | ||
| torch.testing.assert_close(normalized_0, normalized_1, rtol=1e-3, atol=1e-4) | ||
|
|
||
| def test_static_cache_matches_dynamic(self): |
There was a problem hiding this comment.
Merged the checks in this test [check that the outputs are the same] with the checks in test_generate_with_static_cache [check that the shapes are as expected]
| @slow | ||
| @require_torch_accelerator | ||
| @require_read_token | ||
| def test_torch_compile(self): |
There was a problem hiding this comment.
This test was running against very large checkpoints, and we had to manually specify one in _torch_compile_test_ckpt 👀
We were indeed missing a mixin test for generate with model.forward compiled, see the new test_generate_compile. The most important part of this test is to confirm that compiled == not compiled, we have checks for other parts of the problem (including checks to see whether large checkpoints are working as expected)
This heavy test was, in practice, deleted
| @slow | ||
| @require_torch_gpu # Testing cuda graphs. | ||
| @require_read_token | ||
| def test_compile_cuda_graph_time(self): |
There was a problem hiding this comment.
Same as above (uses heavy checkpoints)
We now have a benchmarks page, much better to track performance than a test.
generate tests to the right mixin and delete redundant tests
There was a problem hiding this comment.
Great clean up, good to see redundant code killed! Left a few questions and also I think we had test_embeds_match_ids_generate overwritten in all VLMs so prob those can be removed now
Overall LGTM, thanks
EDIT: forgot to mention that a few models might not be running the new tests because they don't have GenerationTesterMixin. Working on adding that on a few VLMs currently and noticed that tests from here might fail for them. Guess it's okay, i'll skip/fix the failing ones after enabling generation tests
| )[0] | ||
| self.assertTrue(torch.allclose(out_embeds, out_ids)) | ||
|
|
||
| def test_inputs_embeds_matches_input_ids_with_generate(self): |
There was a problem hiding this comment.
one q: afaik the other test is skipped for encoder-decoder models? Can we enable it just to be sure we cover all models with generation abilities?
| # Generate with one batch only to test generation when attention mask will be None | ||
| # when real inputs are used, because there is no padding. See issue #32237 for more | ||
| dummy_input = dummy_input[:1, ...] | ||
| dummy_attention_mask = torch.ones_like(dummy_attention_mask[:1, ...]) | ||
| _ = model.generate( | ||
| dummy_input, | ||
| attention_mask=dummy_attention_mask, | ||
| max_new_tokens=max_new_tokens, | ||
| do_sample=False, | ||
| use_cache=True, | ||
| ) |
There was a problem hiding this comment.
This was testing that FA2 with packing works in generate, and unfortunately we don't have equivalent forward test. We could extend the test_fa2_position_ids test with this specific case
There was a problem hiding this comment.
Makes sense, I'll make sure there is one packing FA2 test (testing the forward pass, and not generate -- generate does not support packing)
There was a problem hiding this comment.
@zucchini-nlp actually test_flash_attn_2_fp32_ln covers the case without attention mask, no need to add more test cases
| @slow | ||
| @is_flaky() # compilation may result in equivalent (!= same) FP ops, causing the argmax in `generate` to be flaky | ||
| def test_generate_compile_fullgraph(self): | ||
| def test_generate_compile(self, _, end_to_end): |
There was a problem hiding this comment.
RUN_SLOW=1 python3 -m pytest -v tests/models/llama/test_modeling_llama.py -k "test_generate_compile"
gives me
ling_llama.py::LlamaModelTest::test_generate_compile_1_end_to_end - RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 38, ...
There was a problem hiding this comment.
@ydshieh have a look at the TODO in the @parameterized of this test -- both parameterizations have broken cases, whose cause comes from outside this PR.
This PR is already quite big, I'd rather fix the root causes of both parameterizations in a follow-up PR 🤗 This PR moves the tests to the right place, and doesn't touch modeling code
ArthurZucker
left a comment
There was a problem hiding this comment.
Long due! Thanks everyone for reviewing and @gante for the cleanup!
…tests (huggingface#34464) * tmp commit * tmp commit * cull overwrites of deleted tests * typo * more specific docstring * make fixup * parameterize at the top? * correction * more deletions :D * tmp commit * for VLMs too * fix _check_outputs * test nit * make fixup * fix another flaky * test_generate_from_inputs_embeds -- handle missing attention mask
Moves
generatetests incorrectly placed in the general mixin toGenerationTesterMixin. In the process, removes redundant tests and streamlines repeated logic 👀test_modeling_utils.py, the last file in the diff. In it, I explain what happened to each test and why -- most were actually deleted, as they were redundant.After this PR:
✅ fewer redundant tests and less code duplication
✅ fewer flaky tests
✅ faster tests
✅ more test coverage
In a follow-up PR:
👉 Fix failing upgraded generate+FA2 test
👉 Fix failing added generate+torch.compile test
Closes #32913